relative pose estimation
Point-PNG: Conditional Pseudo-Negatives Generation for Point Cloud Pre-Training
Mahendren, Sutharsan, Rahman, Saimunur, Koniusz, Piotr, Fernando, Tharindu, Sridharan, Sridha, Fookes, Clinton, Moghadam, Peyman
We propose Point-PNG, a novel self-supervised learning framework that generates conditional pseudo-negatives in the latent space to learn point cloud representations that are both discriminative and transformation-sensitive. Conventional self-supervised learning methods focus on achieving invariance, discarding transformation-specific information. Recent approaches incorporate transformation sensitivity by explicitly modeling relationships between original and transformed inputs. However, they often suffer from an invariant-collapse phenomenon, where the predictor degenerates into identity mappings, resulting in latent representations with limited variation across transformations. To address this, we propose Point-PNG that explicitly penalizes invariant collapse through pseudo-negatives generation, enabling the network to capture richer transformation cues while preserving discriminative representations. To this end, we introduce a parametric network, COnditional Pseudo-Negatives Embedding (COPE), which learns localized displacements induced by transformations within the latent space. A key challenge arises when jointly training COPE with the MAE, as it tends to converge to trivial identity mappings. To overcome this, we design a loss function based on pseudo-negatives conditioned on the transformation, which penalizes such trivial invariant solutions and enforces meaningful representation learning. We validate Point-PNG on shape classification and relative pose estimation tasks, showing competitive performance on ModelNet40 and ScanObjectNN under challenging evaluation protocols, and achieving superior accuracy in relative pose estimation compared to supervised baselines.
Semi-distributed Cross-modal Air-Ground Relative Localization
Lu, Weining, Bin, Deer, Ma, Lian, Ma, Ming, Ma, Zhihao, Chen, Xiangyang, Wang, Longfei, Feng, Yixiao, Jiang, Zhouxian, Shi, Yongliang, Liang, Bin
Efficient, accurate, and flexible relative localization is crucial in air-ground collaborative tasks. However, current approaches for robot relative localization are primarily realized in the form of distributed multi-robot SLAM systems with the same sensor configuration, which are tightly coupled with the state estimation of all robots, limiting both flexibility and accuracy. To this end, we fully leverage the high capacity of Unmanned Ground Vehicle (UGV) to integrate multiple sensors, enabling a semi-distributed cross-modal air-ground relative localization framework. In this work, both the UGV and the Unmanned Aerial Vehicle (UAV) independently perform SLAM while extracting deep learning-based keypoints and global descriptors, which decouples the relative localization from the state estimation of all agents. The UGV employs a local Bundle Adjustment (BA) with LiDAR, camera, and an IMU to rapidly obtain accurate relative pose estimates. The BA process adopts sparse keypoint optimization and is divided into two stages: First, optimizing camera poses interpolated from LiDAR-Inertial Odometry (LIO), followed by estimating the relative camera poses between the UGV and UAV. Additionally, we implement an incremental loop closure detection algorithm using deep learning-based descriptors to maintain and retrieve keyframes efficiently. Experimental results demonstrate that our method achieves outstanding performance in both accuracy and efficiency. Unlike traditional multi-robot SLAM approaches that transmit images or point clouds, our method only transmits keypoint pixels and their descriptors, effectively constraining the communication bandwidth under 0.3 Mbps. Codes and data will be publicly available on https://github.com/Ascbpiac/cross-model-relative-localization.git.
Systematic Comparison of Projection Methods for Monocular 3D Human Pose Estimation on Fisheye Images
Käs, Stephanie, Peter, Sven, Thillmann, Henrik, Burenko, Anton, Adrian, David Benjamin, Mack, Dennis, Linder, Timm, Leibe, Bastian
Fisheye cameras offer robots the ability to capture human movements across a wider field of view (FOV) than standard pinhole cameras, making them particularly useful for applications in human-robot interaction and automotive contexts. However, accurately detecting human poses in fisheye images is challenging due to the curved distortions inherent to fisheye optics. While various methods for undistorting fisheye images have been proposed, their effectiveness and limitations for poses that cover a wide FOV has not been systematically evaluated in the context of absolute human pose estimation from monocular fisheye images. To address this gap, we evaluate the impact of pinhole, equidistant and double sphere camera models, as well as cylindrical projection methods, on 3D human pose estimation accuracy. We find that in close-up scenarios, pinhole projection is inadequate, and the optimal projection method varies with the FOV covered by the human pose. The usage of advanced fisheye models like the double sphere model significantly enhances 3D human pose estimation accuracy. We propose a heuristic for selecting the appropriate projection model based on the detection bounding box to enhance prediction quality. Additionally, we introduce and evaluate on our novel dataset FISHnCHIPS, which features 3D human skeleton annotations in fisheye images, including images from unconventional angles, such as extreme close-ups, ground-mounted cameras, and wide-FOV poses, available at: https://www.vision.rwth-aachen.de/fishnchips
REMOTE: Real-time Ego-motion Tracking for Various Endoscopes via Multimodal Visual Feature Learning
Shao, Liangjing, Chen, Benshuang, Zhao, Shuting, Chen, Xinrong
Real-time ego-motion tracking for endoscope is a significant task for efficient navigation and robotic automation of endoscopy. In this paper, a novel framework is proposed to perform real-time ego-motion tracking for endoscope. Firstly, a multi-modal visual feature learning network is proposed to perform relative pose prediction, in which the motion feature from the optical flow, the scene features and the joint feature from two adjacent observations are all extracted for prediction. Due to more correlation information in the channel dimension of the concatenated image, a novel feature extractor is designed based on an attention mechanism to integrate multi-dimensional information from the concatenation of two continuous frames. To extract more complete feature representation from the fused features, a novel pose decoder is proposed to predict the pose transformation from the concatenated feature map at the end of the framework. At last, the absolute pose of endoscope is calculated based on relative poses. The experiment is conducted on three datasets of various endoscopic scenes and the results demonstrate that the proposed method outperforms state-of-the-art methods. Besides, the inference speed of the proposed method is over 30 frames per second, which meets the real-time requirement. The project page is here: remote-bmxs.netlify.app
Communication-Efficient Cooperative SLAMMOT via Determining the Number of Collaboration Vehicles
The SLAMMOT, i.e. simultaneous localization, mapping, and moving object (detection and) tracking, represents an emerging technology for autonomous vehicles in dynamic environments. Such single-vehicle systems still have inherent limitations, such as occlusion issues. Inspired by SLAMMOT and rapidly evolving cooperative technologies, it is natural to explore cooperative simultaneous localization, mapping, moving object (detection and) tracking (C-SLAMMOT) to enhance state estimation for ego-vehicles and moving objects. C-SLAMMOT could significantly upgrade the single-vehicle performance by utilizing and integrating the shared information through communication among the multiple vehicles. This inevitably leads to a fundamental trade-off between performance and communication cost, especially in a scalable manner as the number of collaboration vehicles increases. To address this challenge, we propose a LiDAR-based communication-efficient C-SLAMMOT (CE C-SLAMMOT) method by determining the number of collaboration vehicles. In CE C-SLAMMOT, we adopt descriptor-based methods for enhancing ego-vehicle pose estimation and spatial confidence map-based methods for cooperative object perception, allowing for the continuous and dynamic selection of the corresponding critical collaboration vehicles and interaction content. This approach avoids the waste of precious communication costs by preventing the sharing of information from certain collaborative vehicles that may contribute little or no performance gain, compared to the baseline method of exchanging raw observation information among all vehicles. Comparative experiments in various aspects have confirmed that the proposed method achieves a good trade-off between performance and communication costs, while also outperforms previous state-of-the-art methods in cooperative perception performance.
Multi-LED Classification as Pretext For Robot Heading Estimation
Carlotti, Nicholas, Nava, Mirko, Giusti, Alessandro
Our chosen model is a Vision-based relative pose estimation of peer robots is a Fully Convolutional Network (FCN) [8] that, given an image, key capability for many multi-robot applications [1]. While predicts two-dimensional maps: one for the state of each some approaches focused on handcrafted algorithms, based LED, one representing the model's belief about the robot on circles printed on paper sheets [2] or exploiting the image-space position, and one about its heading. We take LEDs geometry [3], these systems lacked adaptability to full advantage of the inductive bias of FCNs, in which the different environmental conditions. Recent solutions train value of a pixel in the output map depends on a limited deep neural networks (DNNs) on large amounts of data.
Trajectory Regularization Enhances Self-Supervised Geometric Representation
Wang, Jiayun, Yu, Stella X., Chen, Yubei
Self-supervised learning (SSL) has proven effective in learning high-quality representations for various downstream tasks, with a primary focus on semantic tasks. However, its application in geometric tasks remains underexplored, partially due to the absence of a standardized evaluation method for geometric representations. To address this gap, we introduce a new pose-estimation benchmark for assessing SSL geometric representations, which demands training without semantic or pose labels and achieving proficiency in both semantic and geometric downstream tasks. On this benchmark, we study enhancing SSL geometric representations without sacrificing semantic classification accuracy. We find that leveraging mid-layer representations improves pose-estimation performance by 10-20%. Further, we introduce an unsupervised trajectory-regularization loss, which improves performance by an additional 4% and improves generalization ability on out-of-distribution data. We hope the proposed benchmark and methods offer new insights and improvements in self-supervised geometric representation learning.
Gaussian-Sum Filter for Range-based 3D Relative Pose Estimation in the Presence of Ambiguities
Ahmed, Syed S., Shalaby, Mohammed A., Cossette, Charles C., Ny, Jerome Le, Forbes, James R.
Multi-robot systems must have the ability to accurately estimate relative states between robots in order to perform collaborative tasks, possibly with no external aiding. Three-dimensional relative pose estimation using range measurements oftentimes suffers from a finite number of non-unique solutions, or ambiguities. This paper: 1) identifies and accurately estimates all possible ambiguities in 2D; 2) treats them as components of a Gaussian mixture model; and 3) presents a computationally-efficient estimator, in the form of a Gaussian-sum filter (GSF), to realize range-based relative pose estimation in an infrastructure-free, 3D, setup. This estimator is evaluated in simulation and experiment and is shown to avoid divergence to local minima induced by the ambiguous poses. Furthermore, the proposed GSF outperforms an extended Kalman filter, demonstrates similar performance to the computationally-demanding particle filter, and is shown to be consistent.
Multi-Robot Relative Pose Estimation in SE(2) with Observability Analysis: A Comparison of Extended Kalman Filtering and Robust Pose Graph Optimization
Shin, Kihoon, Sim, Hyunjae, Nam, Seungwon, Kim, Yonghee, Hu, Jae, Kim, Kwang-Ki K.
In this study, we address multi-robot localization issues, with a specific focus on cooperative localization and observability analysis of relative pose estimation. Cooperative localization involves enhancing each robot's information through a communication network and message passing. If odometry data from a target robot can be transmitted to the ego robot, observability of their relative pose estimation can be achieved through range-only or bearing-only measurements, provided both robots have non-zero linear velocities. In cases where odometry data from a target robot are not directly transmitted but estimated by the ego robot, both range and bearing measurements are necessary to ensure observability of relative pose estimation. For ROS/Gazebo simulations, we explore four sensing and communication structures. We compare extended Kalman filtering (EKF) and pose graph optimization (PGO) estimation using different robust loss functions (filtering and smoothing with varying batch sizes of sliding windows) in terms of estimation accuracy. In hardware experiments, two Turtlebot3 equipped with UWB modules are used for real-world inter-robot relative pose estimation, applying both EKF and PGO and comparing their performance.